Prague is one of the most popular city in Europe with many outstanding views and historic, exciting structures. My journey of Prague has started with the hiring of my wife by a global company there. In this study, I give a brief insight about the population, house prices and types and compare venues in different municipal parts of the city for anyone looking for buying a property in Prague or interested in real estate industry.
The data is used in this project can be divided into three categories.
-Population by Municipal Parts: I took the table from ... as csv.
-House sales price: I scrape all available advertisements for Prague in SReality.com which is one the most used webpage for property sale. (around 4700 objects)
-Venues data in municipal parts: I used Foursquare API for getting the data.
Here is the population data for 57 Municipal Parts of Prague.
#Import libraries
import pandas as pd
import numpy as np
df_pra_pop = pd.read_csv("Prague_population.csv")
df_pra_pop
| Districts | Population | |
|---|---|---|
| 0 | Praha 1 | 29563 |
| 1 | Praha 2 | 50363 |
| 2 | Praha 3 | 76041 |
| 3 | Praha 4 | 132068 |
| 4 | Praha 5 | 88225 |
| 5 | Praha 6 | 107213 |
| 6 | Praha 7 | 45226 |
| 7 | Praha 8 | 106218 |
| 8 | Praha 9 | 60601 |
| 9 | Praha 10 | 110571 |
| 10 | Praha 11 | 77324 |
| 11 | Praha 12 | 57821 |
| 12 | Praha 13 | 63554 |
| 13 | Praha 14 | 47761 |
| 14 | Praha 15 | 34720 |
| 15 | Praha 16 | 8694 |
| 16 | Praha 17 | 24075 |
| 17 | Praha 18 | 21516 |
| 18 | Praha 19 | 7326 |
| 19 | Praha 20 | 15652 |
| 20 | Praha 21 | 10860 |
| 21 | Praha 22 | 12559 |
| 22 | Bechovice | 2733 |
| 23 | Benice | 704 |
| 24 | Brezineves | 1754 |
| 25 | Cakovice | 11868 |
| 26 | Dablice | 3647 |
| 27 | Dolni Chabry | 4801 |
| 28 | Dolni Mecholupy | 3314 |
| 29 | Dolni Pocernice | 2709 |
| 30 | Dubec | 4023 |
| 31 | Klanovice | 3757 |
| 32 | Kolodeje | 1580 |
| 33 | Kolovraty | 3912 |
| 34 | Kralovice | 407 |
| 35 | Kreslice | 1117 |
| 36 | Kunratice | 10023 |
| 37 | Libus | 10623 |
| 38 | Lipence | 2911 |
| 39 | Lochkov | 828 |
| 40 | Lysolaje | 1509 |
| 41 | Nebusice | 3372 |
| 42 | Nedvezi | 333 |
| 43 | Petrovice | 6035 |
| 44 | Predni Kopanina | 693 |
| 45 | Reporyje | 5205 |
| 46 | Satalice | 2551 |
| 47 | Seberov | 3309 |
| 48 | Slivenec | 3696 |
| 49 | Sterboholy | 2367 |
| 50 | Suchdol | 7225 |
| 51 | Troja | 1426 |
| 52 | Ujezd | 3498 |
| 53 | Velka Chuchle | 2532 |
| 54 | Vinor | 4557 |
| 55 | Zbraslav | 10049 |
| 56 | Zlicin | 7258 |
Sreality.com is one of the most used advirtisement webpage for house sale. I scraped the data by using chromedriver, and renamed and dropped some of the columns. This part took high amounth time. At the end I have the data frame below. If you want to look how I scraped the data from the page you may look Scraping.ipynb.
-Resource: https://www.sreality.cz/
Here I would like to decribe type of the houses in Czech Republic. The naming could be different than other countries.
df=pd.read_csv('prague_house_prices_v1.csv')
df.head()
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | |
|---|---|---|---|---|---|---|---|
| 0 | Byt 1+kk 43 m² k prodeji Hugo Haase, Praha 5 -... | 1+kk | 43 | 50.0290 | 14.3622 | (50.029, 14.3623) | 4274096 |
| 1 | Byt 3+kk 75 m² k prodeji Na Veselí, Praha 4 - ... | 3+kk | 75 | 50.0560 | 14.4414 | (50.056, 14.4414) | 7990000 |
| 2 | Byt 2+1 71 m² k prodeji Rooseveltova, Praha 6 ... | 2+1 | 71 | 50.1036 | 14.3994 | (50.1036, 14.3995) | 8900000 |
| 3 | Byt 1+kk 27 m² k prodeji Střešovická, Praha 6 ... | 1+kk | 27 | 50.0930 | 14.3773 | (50.0931, 14.3773) | 3299000 |
| 4 | Byt 4+kk 100 m² k prodeji 5. května, Praha 4 -... | 4+kk | 100 | 50.0610 | 14.4318 | (50.061, 14.4318) | 9490000 |
I used Foursquare API for getting the data.
In this section, I created a foluim map to observe Prague Municipal Parts and created a gif for my blog.
#import libs
import wget
import folium
#json
prague_regions_map = r'TMMESTSKECASTI_P1.json' # geojson file
# create a plain Prague map
prague_regions = folium.Map(location=[50.06, 14.45], zoom_start=11)
prague_regions
You can scroll around map to observe municipal parts.
#dumy data for region gif
df_pop_new=df_pra_pop
df_pop_new['Population'] =5
# create a plain world map
communities_map = folium.Map(location=[50.06, 14.45], zoom_start=11, tiles='stamenwatercolor')
# generate choropleth map
choropleth = folium.Choropleth(
geo_data=prague_regions_map,
data=df_pop_new,
columns=['Districts', 'Population'],
key_on='feature.properties.NAZEV_1',
fill_color='YlOrBr',
fill_opacity=1,
line_opacity=1,
highlight=True,
smooth_factor=0).add_to(communities_map)
# add labels indicating the name of the community
style_function = "font-size: 15px; font-weight: bold"
choropleth.geojson.add_child(
folium.features.GeoJsonTooltip(['NAZEV_1'], style=style_function, labels=False))
communities_map.save('map_for_gif.html')
# display map
communities_map
#retrieve data
df_pra_pop = pd.read_csv("Prague_population.csv")
prague_regions_map = r'TMMESTSKECASTI_P1.json'
# let Folium determine the scale.
prague_regions = folium.Map(location=[50.06, 14.45], zoom_start=11, tiles='stamenwatercolor')
prague_regions.choropleth(
geo_data=prague_regions_map,
data=df_pra_pop,
columns=['Districts', 'Population'],
key_on='feature.properties.NAZEV_1',
fill_opacity=0.9,
line_opacity=0.5,
legend_name='Population in Prague',
highlight=True,
reset=True)
prague_regions
c:\users\taner\appdata\local\programs\python\python38\lib\site-packages\folium\folium.py:411: FutureWarning: The choropleth method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this. warnings.warn(
df=pd.read_csv('prague_house_prices_v1.csv')
df.head()
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | |
|---|---|---|---|---|---|---|---|
| 0 | Byt 1+kk 43 m² k prodeji Hugo Haase, Praha 5 -... | 1+kk | 43 | 50.0290 | 14.3622 | (50.029, 14.3623) | 4274096 |
| 1 | Byt 3+kk 75 m² k prodeji Na Veselí, Praha 4 - ... | 3+kk | 75 | 50.0560 | 14.4414 | (50.056, 14.4414) | 7990000 |
| 2 | Byt 2+1 71 m² k prodeji Rooseveltova, Praha 6 ... | 2+1 | 71 | 50.1036 | 14.3994 | (50.1036, 14.3995) | 8900000 |
| 3 | Byt 1+kk 27 m² k prodeji Střešovická, Praha 6 ... | 1+kk | 27 | 50.0930 | 14.3773 | (50.0931, 14.3773) | 3299000 |
| 4 | Byt 4+kk 100 m² k prodeji 5. května, Praha 4 -... | 4+kk | 100 | 50.0610 | 14.4318 | (50.061, 14.4318) | 9490000 |
df['Type'].value_counts()
2+kk 1212 3+kk 984 1+kk 625 3+1 444 4+kk 434 2+1 271 4+1 136 5+kk 104 1+1 93 atypicky 35 5+1 31 6-a-vice 21 Name: Type, dtype: int64
df['Latitude'] = pd.to_numeric(df['Latitude'],errors='coerce')
df['Longitude'] = pd.to_numeric(df['Longitude'],errors='coerce')
df.dtypes
Title object Type object Area int64 Latitude float64 Longitude float64 Coordinates object Price(CZK) int64 dtype: object
df.head(20)
df.shape
(4390, 7)
df['Latitude'].isnull().sum()
4
df['Longitude'].isnull().sum()
2
#find missing values
df[df.isnull().any(axis=1)]
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | |
|---|---|---|---|---|---|---|---|
| 567 | Byt 1+kk 32 m² k prodeji Františka Jansy, Prah... | 1+kk | 32 | NaN | 14.5527 | (50.065, 14.5527) | 1199000 |
| 1917 | Byt 2+kk 101 m² k prodeji Na Ořechovce, Praha ... | 2+kk | 101 | 50.0953 | NaN | (50.0953, 14.381) | 11990000 |
| 2077 | Byt 3+kk 93 m² k prodeji Františka Jansy, Prah... | 3+kk | 93 | NaN | 14.5527 | (50.065, 14.5527) | 2850000 |
| 2434 | Byt 3+kk 101 m² k prodeji Na Ořechovce, Praha ... | 3+kk | 101 | 50.0953 | NaN | (50.0953, 14.381) | 11990000 |
| 2480 | Byt 1+kk 27 m² k prodeji Bajkonurská, Praha 4 ... | 1+kk | 27 | NaN | 14.5286 | (50.03, 14.5286) | 2690000 |
| 3669 | Byt 1+kk 30 m² k prodeji Jurkovičova, Praha 4 ... | 1+kk | 30 | NaN | 14.5296 | (50.027, 14.5296) | 3350000 |
#Correct values
df.iloc[567,3] = 50.065
df.iloc[1917,4] = 14.381
df.iloc[2077,3] = 50.065
df.iloc[2434,4] = 14.381
df.iloc[2480,3] = 50.03
df.iloc[3669,3] = 50.027
#find missing values
df[df.isnull().any(axis=1)]
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) |
|---|
#!pip install shapely
import shapely
import geojson
import json
from shapely.geometry import shape, GeometryCollection, Point
region_list=[]
with open('TMMESTSKECASTI_P1.json', 'r') as f:
js = json.load(f)
for i in range(0,4390):
x= df['Longitude']
y= df['Latitude']
x=x.iloc[i]
y=y.iloc[i]
point = Point(x, y)
for feature in js['features']:
polygon = shape(feature['geometry'])
if polygon.contains(point):
region_list.append(str(feature['properties']['NAZEV_1']))
df_region=pd.DataFrame(region_list)
df_region.shape
(4390, 1)
df['Region'] =df_region
df.head()
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | Region | |
|---|---|---|---|---|---|---|---|---|
| 0 | Byt 1+kk 43 m² k prodeji Hugo Haase, Praha 5 -... | 1+kk | 43 | 50.0290 | 14.3622 | (50.029, 14.3623) | 4274096 | Praha 5 |
| 1 | Byt 3+kk 75 m² k prodeji Na Veselí, Praha 4 - ... | 3+kk | 75 | 50.0560 | 14.4414 | (50.056, 14.4414) | 7990000 | Praha 4 |
| 2 | Byt 2+1 71 m² k prodeji Rooseveltova, Praha 6 ... | 2+1 | 71 | 50.1036 | 14.3994 | (50.1036, 14.3995) | 8900000 | Praha 6 |
| 3 | Byt 1+kk 27 m² k prodeji Střešovická, Praha 6 ... | 1+kk | 27 | 50.0930 | 14.3773 | (50.0931, 14.3773) | 3299000 | Praha 6 |
| 4 | Byt 4+kk 100 m² k prodeji 5. května, Praha 4 -... | 4+kk | 100 | 50.0610 | 14.4318 | (50.061, 14.4318) | 9490000 | Praha 4 |
df.tail()
| Title | Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | Region | |
|---|---|---|---|---|---|---|---|---|
| 4385 | Byt 2+kk 58 m² k prodeji Tlumačovská, Praha 5 ... | 2+kk | 58 | 50.0399 | 14.3357 | (50.0399, 14.3358) | 5850000 | Praha 13 |
| 4386 | Byt 4+kk 126 m² k prodeji Praha 5; 9 989 000 K... | 4+kk | 126 | 50.0835 | 14.4341 | (50.0835, 14.4341) | 9989000 | Praha 1 |
| 4387 | Byt 1+kk 35 m² k prodeji Olšanská, Praha 3 - Ž... | 1+kk | 35 | 50.0833 | 14.4662 | (50.0834, 14.4662) | 4290000 | Praha 3 |
| 4388 | Byt 4+kk 140 m² k prodeji Na pomezí, Praha 5 -... | 4+kk | 140 | 50.0614 | 14.3663 | (50.0615, 14.3663) | 21679000 | Praha 5 |
| 4389 | Byt 1+kk 39 m² k prodeji Grafická, Praha 5 - S... | 1+kk | 39 | 50.0730 | 14.3945 | (50.0731, 14.3945) | 5990000 | Praha 5 |
df.drop(['Title'], axis=1, inplace=True)
df.head()
| Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | Region | |
|---|---|---|---|---|---|---|---|
| 0 | 1+kk | 43 | 50.0290 | 14.3622 | (50.029, 14.3623) | 4274096 | Praha 5 |
| 1 | 3+kk | 75 | 50.0560 | 14.4414 | (50.056, 14.4414) | 7990000 | Praha 4 |
| 2 | 2+1 | 71 | 50.1036 | 14.3994 | (50.1036, 14.3995) | 8900000 | Praha 6 |
| 3 | 1+kk | 27 | 50.0930 | 14.3773 | (50.0931, 14.3773) | 3299000 | Praha 6 |
| 4 | 4+kk | 100 | 50.0610 | 14.4318 | (50.061, 14.4318) | 9490000 | Praha 4 |
#export
df.to_csv('Prague_houses.csv', index=False)
#exclude the regions have objects less than 10
s = df['Region'].value_counts()
df_new = df[df.isin(s.index[s >= 17]).values]
#Average price by region
df_grouped=df_new.groupby('Region').mean()
df_grouped_price=df_grouped[['Price(CZK)']]
df_grouped_area=df_grouped[['Area']]
#sort
df_grouped_price.sort_values(['Price(CZK)'], ascending=False, axis=0, inplace=True)
df_grouped_area.sort_values(['Area'], ascending=False, axis=0, inplace=True)
#get the top 7 entries
df_price_top6 = df_grouped_price.head(6)
df_area_top6 = df_grouped_area.head(6)
<ipython-input-33-a0c47cd52c7a>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_grouped_price.sort_values(['Price(CZK)'], ascending=False, axis=0, inplace=True) <ipython-input-33-a0c47cd52c7a>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_grouped_area.sort_values(['Area'], ascending=False, axis=0, inplace=True)
# use the inline backend to generate the plots within the browser
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot') # optional: for ggplot-like style
# check for latest version of Matplotlib
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0
Matplotlib version: 3.3.2
# step 2: plot data
df_price_top6.plot(kind='bar', figsize=(10, 6))
plt.xlabel('Region') # add to x-label to the plot
plt.ylabel("Price in million CZK") # add y-label to the plot
plt.title('Average House Prices by Regions in Prague') # add title to the plot
#Axes.ticklabel_format(self, *, axis='both', style='', scilimits=None, useOffset=None, useLocale=None, useMathText=None)
plt.ticklabel_format(scilimits=(6,6), axis='y')
plt.show()
# step 2: plot data
df_area_top6.plot(kind='bar', figsize=(10, 6), color='orange')
plt.xlabel('Region') # add to x-label to the plot
plt.ylabel("Area per house (m2)") # add y-label to the plot
plt.title('Average Area per House in Prague Regions ') # add title to the plot
plt.show()
df_grouped_type=df.groupby('Type').count()
df_type=df_grouped_type[['Area']]
df_type.rename(columns={'Area' : 'Count'}, inplace=True)
c:\users\taner\appdata\local\programs\python\python38\lib\site-packages\pandas\core\frame.py:4300: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().rename(
df_type.sort_values(['Count'], ascending=False, axis=0, inplace=True)
df_type=df_type.head(5)
<ipython-input-38-76b43bdfd13d>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_type.sort_values(['Count'], ascending=False, axis=0, inplace=True)
# step 2: plot data
df_type.plot(kind='bar', figsize=(10, 6), color='green')
plt.xlabel('Type of the House') # add to x-label to the plot
plt.ylabel("Number of House") # add y-label to the plot
plt.title('Number of House by Type in Prague') # add title to the plot
plt.show()
df_incidents = df.sort_values(['Price(CZK)'], ascending=False, axis=0, inplace=False).head(20)
df_incidents
| Type | Area | Latitude | Longitude | Coordinates | Price(CZK) | Region | |
|---|---|---|---|---|---|---|---|
| 290 | 5+kk | 235 | 50.0816 | 14.4137 | (50.0817, 14.4138) | 79000000 | Praha 1 |
| 27 | 5+kk | 235 | 50.0833 | 14.4139 | (50.0833, 14.4139) | 79000000 | Praha 1 |
| 3358 | 5+kk | 235 | 50.0816 | 14.4137 | (50.0817, 14.4138) | 79000000 | Praha 1 |
| 1142 | atypicky | 445 | 50.0865 | 14.4323 | (50.0866, 14.4323) | 59870000 | Praha 1 |
| 3398 | 4+kk | 255 | 50.0903 | 14.4216 | (50.0904, 14.4217) | 54450000 | Praha 1 |
| 358 | 4+kk | 229 | 50.0845 | 14.4177 | (50.0846, 14.4178) | 50000000 | Praha 1 |
| 3270 | 5+kk | 302 | 50.0614 | 14.3663 | (50.0615, 14.3663) | 49849000 | Praha 5 |
| 2889 | 4+kk | 169 | 50.0891 | 14.4183 | (50.0892, 14.4184) | 49000000 | Praha 1 |
| 1769 | 6-a-vice | 243 | 50.0730 | 14.3889 | (50.0731, 14.389) | 45990000 | Praha 5 |
| 942 | 4+kk | 315 | 50.0845 | 14.4177 | (50.0846, 14.4178) | 45000000 | Praha 1 |
| 1969 | 4+kk | 256 | 50.0845 | 14.4177 | (50.0846, 14.4178) | 45000000 | Praha 1 |
| 1068 | 6-a-vice | 318 | 50.0092 | 14.3858 | (50.0093, 14.3858) | 42025000 | Velka Chuchle |
| 591 | 3+1 | 329 | 50.0865 | 14.4323 | (50.0866, 14.4323) | 41928000 | Praha 1 |
| 974 | 4+kk | 329 | 50.0865 | 14.4323 | (50.0866, 14.4323) | 41928000 | Praha 1 |
| 3483 | atypicky | 388 | 50.0828 | 14.3764 | (50.0828, 14.3765) | 39900000 | Praha 6 |
| 2605 | 6-a-vice | 241 | 50.0735 | 14.4629 | (50.0735, 14.463) | 39890000 | Praha 10 |
| 2877 | 3+1 | 135 | 50.0878 | 14.4153 | (50.0878, 14.4153) | 38000000 | Praha 1 |
| 1750 | 4+1 | 226 | 50.0981 | 14.4706 | (50.0981, 14.4707) | 37990000 | Praha 8 |
| 4139 | 5+kk | 209 | 50.0770 | 14.3876 | (50.0771, 14.3877) | 37550000 | Praha 5 |
| 1967 | 4+kk | 211 | 50.0795 | 14.4139 | (50.0795, 14.414) | 37030007 | Praha 1 |
prague_regions_v2 = folium.Map(location=[50.06, 14.45], zoom_start=11, tiles='stamenwatercolor')
prague_regions_v2.choropleth(
geo_data=prague_regions_map,
key_on='feature.properties.NAZEV_1',
fill_opacity=0.9,
line_opacity=0.5,
highlight=True,
fill_color='yellow',
legend_name=None,
reset=True)
incidents = folium.map.FeatureGroup()
# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Latitude, df_incidents.Longitude):
incidents.add_child(
folium.CircleMarker(
[lat, lng],
radius=5, # define how big you want the circle markers to be
color='red',
fill=True,
fill_color=None,
fill_opacity=0.3
)
)
# add incidents to map
prague_regions_v2.add_child(incidents)
c:\users\taner\appdata\local\programs\python\python38\lib\site-packages\folium\folium.py:411: FutureWarning: The choropleth method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this. warnings.warn(
I will look for correlation between price and other factors in this section.
I will compare and cluster region in this section.